Adaptable Wrapper Generation for Web Page Format Change
نویسنده
چکیده
In this paper, we propose an adaptive wrapper generator that can generate adaptable wrapper for adapting networked information sources (NIS) format changes. When NIS’s format changed, the adaptable wrapper can start recovery phase to discover the extraction rule of the new format of target NIS. The wrapper can automatically adapt the changes of content tag and accurately extract information. The wrapper is also examined in 6 websites in 3 kind of NIS, and the result shows that the average precision is over 98%. It can conclude that the generated adaptable wrapper can adapt the NIS’s format changes and accurately extract information.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملAutomatic Wrapper Generation and Maintenance
This paper investigates automatic wrapper generation and maintenance for Forums, Blogs and News web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. The tree alignment algorithm is adopted...
متن کاملA Multi-Page Data Extraction Service
We present a service-oriented architecture and a set of techniques for developing wrapper code generators, including the methodology of designing an effective wrapper program construction facility and a concrete implementation, called XWRAPComposer. Our wrapper generation framework has two unique design goals. First, we explicitly separate tasks of building wrappers that are specific to a Web s...
متن کاملWrapper Maintenance
A Web wrapper is a software application that extracts information from a semi-structured source and converts it to a structured format. While semi-structured sources, such as Web pages, contain no explicitly specified schema, they do have an implicit grammar that can be used to identify relevant information in the document. A wrapper learning system analyzes page layout to generate either gramm...
متن کاملRecognizing Structure in Web Pages using Similarity Queries
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first by our method is "meaningful"--i.e., a structure that was used in a han...
متن کامل